1. Introduction

The main reason for this project is to demonstrate ability to use R to import, clean, transform and present data. By applying our newly gained knowledge for our selected data sets, we are aiming to use it for analysis, model and chart creation for better understanding of data.

For this purpose we focused on the prices of Airbnb listings, burglary rate and average apartment rent prices in the districts of the city of Zurich in 2022.

The main goal is to find out how burglary rate and neighborhood group influences Airbnb rental prices in the districts of Zurich.

For the analysis we used following variables:

2. Libraries

library(tidyverse) # data cleaning, sorting, structuring and preparation
library(plotly) # interactive data vizualisation

3. Data Collection and Import

# Be aware of working directory to import data sets correctly
getwd()

4. Data Cleaning and Preparation

4.1. Data set of Listings of Airbnb

Firstly, all quarterly data is merged together.

tmp_ab_combained <- rbind(ab_listing_0922, 
                          ab_listing_1222,
                          ab_listing_0323, 
                          ab_listing_0623)

After merging quarterly data in one data set, we select only the columns of our interest.

tmp_ab_combained_1 <- tmp_ab_combained %>% 
  select(id, name, neighbourhood_group, neighbourhood, latitude, 
         longitude, room_type, price)

Reviewing combined data frame to inspect for the correct variable classifications and spot any data errors or inconsistencies.

str(tmp_ab_combained_1)
## 'data.frame':    8943 obs. of  8 variables:
##  $ id                 : num  73282 143821 178448 204586 222565 ...
##  $ name               : chr  "Clean, central, quiet" "marvelous LOFT in SIHLCITY Zürich" "a lovely place, top location" "very nice luxury city apartment" ...
##  $ neighbourhood_group: chr  "Kreis 3" "Kreis 3" "Kreis 2" "Kreis 10" ...
##  $ neighbourhood      : chr  "Sihlfeld" "Alt-Wiedikon" "Enge" "Höngg" ...
##  $ latitude           : num  47.4 47.4 47.4 47.4 47.3 ...
##  $ longitude          : num  8.52 8.52 8.53 8.48 8.54 ...
##  $ room_type          : chr  "Entire home/apt" "Entire home/apt" "Private room" "Private room" ...
##  $ price              : int  100 200 60 200 75 650 148 72 159 210 ...

Since we combined 4 data sets of Airbnb listings, it is quite likely, that in the past year there were listings, which appeared in all quarterly lists for the past year. They must be detected by listings id and eliminated for further analysis.

tmp_ab_no_duplicates <- tmp_ab_combained_1 %>% 
  distinct(id, .keep_all = TRUE)

Data set contains little information about hotel and shared rooms. In order to make our data for analysis more consistent, we remove these types of listings from our final cleaned data set.

room_type_table <-  table(tmp_ab_no_duplicates$room_type)
room_type_table

ab_final <- tmp_ab_no_duplicates %>% 
  filter(room_type %in% c("Entire home/apt", "Private room"))

head(ab_final)
unique(ab_final$room_type)

# checking the data if there is some missing values
anyNA(ab_final)

Quick final glimpse of cleaned data set.

head(ab_final, n = 1)
##      id                  name neighbourhood_group neighbourhood latitude
## 1 73282 Clean, central, quiet             Kreis 3      Sihlfeld 47.37374
##   longitude       room_type price
## 1   8.51957 Entire home/apt   100

4.2. Data set of burglary rate in Zurich

Reviewing structure of data frame to inspect for the correct variable classifications and spot any data errors or inconsistencies.

str(burglary)
## 'data.frame':    7350 obs. of  13 variables:
##  $ Ausgangsjahr        : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ Gemeinde_BFS_Nr     : int  21 21 21 131 131 131 241 241 241 1 ...
##  $ Gemeindename        : chr  "Adlikon" "Adlikon" "Adlikon" "Adliswil" ...
##  $ Stadtkreis_BFS_Nr   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Stadtkreis_Name     : chr  "" "" "" "" ...
##  $ Gesetz_Nummer       : int  311 311 311 311 311 311 311 311 311 311 ...
##  $ Gesetz_Abk          : chr  "StGB" "StGB" "StGB" "StGB" ...
##  $ Tatbestand          : chr  "Einbruchdiebstahl" "Einschleichdiebstahl" "Einbrüche insgesamt" "Einbruchdiebstahl" ...
##  $ Straftaten_total    : int  7 0 7 159 33 192 10 2 12 3 ...
##  $ Straftaten_vollendet: int  6 0 6 114 32 146 7 2 9 3 ...
##  $ Straftaten_versucht : int  1 0 1 45 1 46 3 0 3 0 ...
##  $ Einwohner           : int  569 569 569 16052 16052 16052 987 987 987 1700 ...
##  $ Häufigkeitszahl     : num  12.3 0 12.3 9.9 2.1 12 10.1 2 12.2 1.8 ...

Filtering for year 2022, city of Zurich, which municipality’s number is 261 and category of total crime in the district.

burglary_final <- burglary %>% 
  select(Year, MunicipalityBFSNr, DistrictName, FactsEnglish, 
         OffencesTotal, Residents) %>% 
  filter(Year == 2022, 
         MunicipalityBFSNr == 261, 
         FactsEnglish == "TotalCrime", 
         grepl("^Kreis", DistrictName) ) %>% 
  mutate("BurglariesPerResident" = OffencesTotal/Residents) %>% 
  arrange(desc(BurglariesPerResident))

Quick final glimpse of cleaned data set.

head(burglary_final, n = 1)
##   MunicipalityBFSNr DistrictName FactsEnglish OffencesTotal Residents
## 1               261      Kreis 1   TotalCrime           153      5411
##   BurglariesPerResident
## 1            0.02827573

4.3. Data set of apartment rent rate in Zurich

Reviewing structure of data frame to inspect for the correct variable classifications and spot any data errors or inconsistencies.

str(rent)

Selecting only the column that we need for our analysis.

rent_final <- rent %>%
  select(RaumeinheitLang, GliederungLang, ZimmerSort, GemeinnuetzigLang, 
         EinheitLang, PreisartLang, mean) %>% 
  filter(grepl("^Kreis", GliederungLang),
         GemeinnuetzigLang == "Alle Wohnungen",
         RaumeinheitLang == "Stadtkreise", 
         PreisartLang == "Netto", 
         EinheitLang == "Wohnung") %>% 
  group_by(GliederungLang) %>% 
  summarise(avg.price = round(mean(mean), 0)) %>% 
  rename(neighbourhood_group = GliederungLang) %>% 
  arrange(desc(avg.price))

Quick final glimpse of cleaned data set.

head(rent_final, n = 1)
## # A tibble: 1 × 2
##   neighbourhood_group avg.price
##   <chr>                   <dbl>
## 1 Kreis 1                  2317

4.4. Joining cleaned data sets in master data set

We have three cleaned and trimmed data sets. Left_join function is used to merge them together by the name of the district.

master_final <- ab_final %>% left_join(burglary_final, 
                                     by = "neighbourhood_group") %>% 
  left_join(rent_final, by = "neighbourhood_group")
# write.csv(master.final, file = "master.final.csv", row.names = FALSE) #NEW

# Quick check 
head(master_final)
##       id                                             name neighbourhood_group
## 1  73282                            Clean, central, quiet             Kreis 3
## 2 143821                marvelous LOFT in SIHLCITY Zürich             Kreis 3
## 3 178448                     a lovely place, top location             Kreis 2
## 4 204586                  very nice luxury city apartment            Kreis 10
## 5 222565                Bedroom overlooking the lake near             Kreis 2
## 6 227039 *Luxury Penthouse in the heart of trendy Zurich*             Kreis 5
##   neighbourhood latitude longitude       room_type price MunicipalityBFSNr
## 1      Sihlfeld 47.37374   8.51957 Entire home/apt   100               261
## 2  Alt-Wiedikon 47.35724   8.52304 Entire home/apt   200               261
## 3          Enge 47.36565   8.52753    Private room    60               261
## 4         Höngg 47.40656   8.48465    Private room   200               261
## 5   Wollishofen 47.33463   8.54117    Private room    75               261
## 6   Escher Wyss 47.38942   8.51881 Entire home/apt   650               261
##   FactsEnglish OffencesTotal Residents BurglariesPerResident avg.price
## 1   TotalCrime           304     48865           0.006221222      1564
## 2   TotalCrime           304     48865           0.006221222      1564
## 3   TotalCrime           251     35494           0.007071618      1803
## 4   TotalCrime           180     39719           0.004531836      1614
## 5   TotalCrime           251     35494           0.007071618      1803
## 6   TotalCrime           194     15331           0.012654100      1801

5. Data Visualisation

5.1 Price distribution agains district

The violin plot below shows the relationship of neighborhood group to price per night in CHF for two groups entire home/apartment and private room. For the group entire home/apartment the shape of Kries 3,7,8 ( skinny on each end and wide in the middle) indicates the price of this type of accommodation located around the median. The box plot elements show the median of Kries 12 is lower than for another neighborhood group. Regarding the private room, we have a distribution shape around the median for Kries 2,4,6,10. Kries 10 has the lowest median price. Focusing on the median value, entire home/apartment has a higher median than private room for all neighborhood groups. This is because entire home/apartment are generally more expensive than private room. Also in this case, we use the libraries “ggplot2” and “plotly”. In addition, the type of graph that we’ve selected is the box plot and violin plot. The reason that shows us the distribution of numerical data and skewness. In particular, with the interactive graph, it’s possible to visualize the median, quartiles, and outliers.

5.2 Percentage of room type as a function of districts

Across all districts, the dominant type of accommodation offered on Airbnb is entire home/apartment. Kries 3 and 4 possess the highest percentage of entire home/apartment while Kries 12 has the smallest share of such a type of accommodation. On the other hand, Kries 4 offers the most private rooms for rent, while Kries 12 offers the fewest. This observation can be explained by the fact that Kries 3 and 4 may be close to the city center or that this district is located in a touristic accommodation area. However, Kries 12 may be located in the inner city of Zurich; it’s more of a residential area than a tourist area.

5.3 Plot the number of accommodation in every neighborhood

The data in the plot illustrates that Langstrasse has the highest number of accommodations, followed by Altstetten and Sihlfeld. These neighborhoods correspond to Kies 4, 9, and 3, respectively. In the other direction, Saatlen, which belongs to Kries 12, has the lowest number of accommodations.

5.4 Plot burglary rate against district

The burglary rate per neighborhood varies from 0.3 to almost 3 per hundred residents in the area. District number 1 has the highest burglary rate with almost 3 burglaries per 100 residents, which is far above the average burglary rate in other neighborhoods. It could be explained by the fact that in this district the central train station is located, which tends to attract all sorts of crowds, as well as the infamous Langstrasse street, which is packed with special entertainment and activities.

6. Fit model

6.1 Simple linear model

We aim to conduct a basic linear model analysis to assess whether alterations in burglary rates have a substantial impact on property prices. In essence, we seek to determine if changes in price are linked to burglaries.

## 
## Call:
## lm(formula = price ~ BurglariesPerResident, data = less_than_500)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -117.54  -56.73  -20.00   34.72  362.79 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            129.663      2.699  48.043  < 2e-16 ***
## BurglariesPerResident 1339.426    260.358   5.145 2.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.13 on 2679 degrees of freedom
## Multiple R-squared:  0.009783,   Adjusted R-squared:  0.009413 
## F-statistic: 26.47 on 1 and 2679 DF,  p-value: 2.875e-07

According to the model, we have a significant relationship between burglary rates and property prices. To rephrase it, it would mean that variations in safety levels, particularly in terms of burglary incidents, have an impact on property pricing.

As the plot shows, the distribution of the point does not follow a linear dispersion. Most likely, aiming for another model will be more adequate.

After all these considerations, we conclude that burglaries Per Resident and price have no relation to causality. Every price depends on other parameters, for example, location and type of accommodation. As a result, the price also depends on it. Moreover, we want to specify an aspect related to the price that will appear frequently in our data. Some prices are very high; they don’t reflect reality, so we decided to remove them.

6.2 ANOVA model

First of all, we want to perform an ANOVA test to evaluate if changing the burglaries and neighborhood group significantly affects the price. In other words, we want to measure if price variation is influenced by location and safety level.

##                         Df    Sum Sq  Mean Sq F value   Pr(>F)    
## (Intercept)              1  96972530 96972530 521.692  < 2e-16 ***
## BurglariesPerResident    1   3282863  3282863  17.661 2.72e-05 ***
## neighbourhood_group     10   2093493   209349   1.126    0.338    
## Residuals             2832 526414770   185881                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary result shows that the ANOVA model is significant for burglaries per resident at a significance level of alpha = 0.05, but not for neighborhood group. In fact, there is a main effect of burglaries per resident on the price, F(1,2832), p = 2.72e-05. However, there is no main effect of neighborhood group on the price, F(1,2832), p = 0.338.

7. Chapter of choice

7.1 Feature extraction with the help of regex

For the first part of chapter of choice, we decided to extract some information from the column “name”, which containes mixed information about the particular listing. Firstly, we spot the pattern, that was repetitive in the column, then with the help of regex we extracted information and enriched our data set with the information of how many bedrooms, beds and bathrooms are in each listing. That was done only with the data of most recent listings data and was not used for further data exploration.

for (i in 1:nrow(ab_listing_0623)){
  row_words <- unlist(str_split(ab_listing_0623[i, 2], " "))  # Split second 
  # column's information by space
  
  bedroom_idx <- which(str_detect(row_words, "^bedr")) # extract words with bedr
  beds_idx <- which(str_detect(row_words, "^bed")) # extract words with bed
  bath_idx <- which(str_detect(row_words, "bath|baths")) # extract words with bath
  
  # Check if bedroom indexes are found and populate newly created row
  if (length(bedroom_idx) > 0){ 
    ab_listing_0623[i, "NrOfBedrooms"] <- as.numeric(row_words[max(1, bedroom_idx - 1)])
    # we know, that the index of number is one less than the index of the word that we
    # are looking for.
  } else {
    ab_listing_0623[i, "NrOfBedrooms"] <- NA
  }
  
  # Check if beds indexes are found and populate newly created row
  if (length(beds_idx) > 0){ 
    ab_listing_0623[i, "NrOfBeds"] <- as.numeric(row_words[max(1, beds_idx - 1)])
    # we know, that the index of number is one less than the index of the word that we
    # are looking for.
  } else {
    ab_listing_0623[i, "NrOfBeds"] <- NA
  }
  
  # Check if bath indexes are found and populate newly created row
  if (length(bath_idx) > 0){ 
    ab_listing_0623[i, "NrOfBaths"] <- as.numeric(row_words[max(1, bath_idx - 1)])
    # we know, that the index of number is one less than the index of the word that we
    # are looking for.
  } else {
    ab_listing_0623[i, "NrOfBaths"] <- NA
  }
}
##      id                                                       name host_id
## 1 73282 Rental unit in Zurich · ★4.78 · 1 bedroom · 1 bed · 1 bath  377532
##   host_name neighbourhood_group neighbourhood latitude longitude
## 1    Simona             Kreis 3      Sihlfeld 47.37167   8.51948
##         room_type price minimum_nights number_of_reviews last_review
## 1 Entire home/apt   100           1125                49  2019-04-27
##   reviews_per_month calculated_host_listings_count availability_365
## 1              0.36                              1              358
##   number_of_reviews_ltm license NrOfBedrooms NrOfBeds NrOfBaths
## 1                     0      NA            1        1         1

7.2 Interactive map with Leaflet and Shiny

For the second part of chapter of our choice, we decided to create interactive map with Leaflet and Shiny packages. Our data sets consist of complex layers of data, which is easier to explore in interactive manner. This small application made with help of Shiny package. Application could be found in the 4_shinyApp folder separate from this document. Application allows the user to compare various information layers of Airbnb listings, burglary rate and average rent prices in each district or between them. Layout of application is chosen to help user to understand the data better.